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Abstract 

In applications throughout science and engineering one is often faced with the challenge of 
solving an ill-posed inverse problem, where the number of available measurements is smaller 
than the dimension of the model to be estimated. However in many practical situations of 
interest, models are constrained structurally so that they only have a few degrees of freedom 
relative to their ambient dimension. This paper provides a general framework to convert notions 
of simplicity into convex penalty functions, resulting in convex optimization solutions to linear, 
underdetermined inverse problems. The class of simple models considered are those formed as 
the sum of a few atoms from some (possibly infinite) elementary atomic set; examples include 
well-studied cases such as sparse vectors (e.g., signal processing, statistics) and low-rank ma- 
trices (e.g., control, statistics), as well as several others including sums of a few permutations 
matrices (e.g., ranked elections, multiobject tracking), low-rank tensors (e.g., computer vision, 
neuroscience) , orthogonal matrices (e.g., machine learning), and atomic measures (e.g., system 
identification). The convex programming formulation is based on minimizing the norm induced 
by the convex hull of the atomic set; this norm is referred to as the atomic norm. The facial 



structure of the atomic norm ball carries a number of favorable properties that are useful for 
recovering simple models, and an analysis of the underlying convex geometry provides sharp 
estimates of the number of generic measurements required for exact and robust recovery of 
models from partial information. These estimates are based on computing the Gaussian widths 
of tangent cones to the atomic norm ball. When the atomic set has algebraic structure the 
resulting optimization problems can be solved or approximated via semidefinite programming. 
The quality of these approximations affects the number of measurements required for recovery, 
and this tradeoff is characterized via some examples. Thus this work extends the catalog of sim- 
ple models (beyond sparse vectors and low-rank matrices) that can be recovered from limited 
linear information via tractable convex programming. 

Keywords: Convex optimization; semidefinite programming; atomic norms; real algebraic 
geometry; Gaussian width; symmetry. 



'Email: {venkatc,parrilo,willsky}@mit.edu; brecht@cs.wisc.edu. This work was supported in part by AFOSR 
grant FA9550-08-1-0180, in part by a MURI through ARO grant W911NF-06- 1-0076, in part by a MURI through 
AFOSR grant FA9550-06-1-0303, in part by NSF FRG 0757207, in part through ONR award N00014-11- 1-0723, and 
NSF award CCF-1139953. 



1 



1 Introduction 



Deducing the state or structure of a system from partial, noisy measurements is a fundamental task 
throughout the sciences and engineering. A commonly encountered difficulty that arises in such 
inverse problems is the limited availability of data relative to the ambient dimension of the signal 
to be estimated. However many interesting signals or models in practice contain few degrees of 
freedom relative to their ambient dimension. For instance a small number of genes may constitute 
a signature for disease, very few parameters may be required to specify the correlation structure in 
a time series, or a sparse collection of geometric constraints might completely specify a molecular 
configuration. Such low-dimensional structure plays an important role in making inverse problems 
well-posed. In this paper we propose a unified approach to transform notions of simplicity into 
convex penalty functions, thus obtaining convex optimization formulations for inverse problems. 

We describe a model as simple if it can be written as a nonnegative combination of a few 
elements from an atomic set. Concretely let x G R p be formed as follows: 



where A is a set of atoms that constitute simple building blocks of general signals. Here we 
assume that x is simple so that k is relatively small. For example A could be the finite set of 
unit-norm one-sparse vectors in which case x is a sparse vector, or A could be the infinite set 
of unit-norm rank-one matrices in which case x is a low-rank matrix. These two cases arise in 
many applications, and have received a tremendous amount of attention recently as several authors 
have shown that sparse vectors and low-rank matrices can be recovered from highly incomplete 
information [161 l26j \T7\ l62"j 117] . However a number of other structured mathematical objects also 
fit the notion of simplicity described in JT]). The set A could be the collection of unit-norm rank- 
one tensors, in which case x is a low-rank tensor and we are faced with the familiar challenge of 
low-rank tensor decomposition. Such problems arise in numerous applications in computer vision 
and image processing [T], and in neuroscience [5]. Alternatively A could be the set of permutation 
matrices; sums of a few permutation matrices are objects of interest in ranking [42J and multi-object 
tracking. As yet another example, A could consist of measures supported at a single point so that 
x is an atomic measure supported at just a few points. This notion of simplicity arises in problems 
in system identification and statistics. 

In each of these examples as well as several others, a fundamental problem of interest is to recover 
x given limited linear measurements. For instance the question of recovering a sparse function over 
the group of permutations (i.e., the sum of a few permutation matrices) given linear measurements 
in the form of partial Fourier information was investigated in the context of ranked election problems 
[42] . Similar linear inverse problems arise with atomic measures in system identification, with 
orthogonal matrices in machine learning, and with simple models formed from several other atomic 
sets (see Section 12.21 for more examples). Hence we seek tractable computational tools to solve 
such problems. When A is the collection of one-sparse vectors, a method of choice is to use the l\ 
norm to induce sparse solutions. This method has seen a surge in interest in the last few years as it 
provides a tractable convex optimization formulation to exactly recover sparse vectors under various 
conditions [161 [26"1 127] . More recently the nuclear norm has been proposed as an effective convex 
surrogate for solving rank minimization problems subject to various affine constraints [621 117] , 

Motivated by the success of these methods we propose a general convex optimization frame- 
work in Section [2] in order to recover objects with structure of the form ([1]) from limited linear 
measurements. The guiding question behind our framework is: how do we take a concept of sim- 
plicity such as sparsity and derive the t\ norm as a convex heuristic? In other words what is the 
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Figure 1: Unit balls of some atomic norms: In each figure, the set of atoms is graphed in red and 
the unit ball of the associated atomic norm is graphed in blue. In (a), the atoms are the unit- 
Euclidean-norm one-sparse vectors, and the atomic norm is the t\ norm. In (b), the atoms are the 
2x2 symmetric unit-Euclidean-norm rank-one matrices, and the atomic norm is the nuclear norm. 
In (c), the atoms are the vectors {— 1, +1} 2 , and the atomic norm is the too norm. 



natural procedure to go from the set of one-sparse vectors A to the l\ norm? We observe that 
the convex hull of (unit-Euclidean-norm) one-sparse vectors is the unit ball of the l\ norm, or the 
cross-polytope. Similarly the convex hull of the (unit-Euclidean-norm) rank-one matrices is the 
nuclear norm ball; see Figure [T] for illustrations. These constructions suggest a natural generaliza- 
tion to other settings. Under suitable conditions the convex hull conv(.A) defines the unit ball of 
a norm, which is called the atomic norm induced by the atomic set A. We can then minimize the 
atomic norm subject to measurement constraints, which results in a convex programming heuristic 
for recovering simple models given linear measurements. As an example suppose we wish to recover 
the sum of a few permutation matrices given linear measurements. The convex hull of the set of 
permutation matrices is the Birkhoff polytope of doubly stochastic matrices [73], and our proposal 
is to solve a convex program that minimizes the norm induced by this polytope. Similarly if we 
wish to recover an orthogonal matrix from linear measurements we would solve a spectral norm 
minimization problem, as the spectral norm ball is the convex hull of all orthogonal matrices. As 
discussed in Section [2.51 the atomic norm minimization problem is, in some sense, the best convex 
heuristic for recovering simple models with respect to a given atomic set. 

We give general conditions for exact and robust recovery using the atomic norm heuristic. In 
Section [3] we provide concrete bounds on the number of generic linear measurements required for 
the atomic norm heuristic to succeed. This analysis is based on computing certain Gaussian widths 
of tangent cones with respect to the unit balls of the atomic norm |37j . Arguments based on Gaus- 
sian width have been fruitfully applied to obtain bounds on the number of Gaussian measurements 
for the special case of recovering sparse vectors via l\ norm minimization [641 [67] , but computing 
Gaussian widths of general cones is not easy. Therefore it is important to exploit the special struc- 
ture in atomic norms, while still obtaining sufficiently general results that are broadly applicable. 
An important theme in this paper is the connection between Gaussian widths and various notions 
of symmetry. Specifically by exploiting symmetry structure in certain atomic norms as well as con- 
vex duality properties, we give bounds on the number of measurements required for recovery using 
very general atomic norm heuristics. For example we provide precise estimates of the number of 
generic measurements required for exact recovery of an orthogonal matrix via spectral norm min- 
imization, and the number of generic measurements required for exact recovery of a permutation 
matrix by minimizing the norm induced by the Birkhoff polytope. While these results correspond 
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Underlying model 


Convex heuristic 


# Gaussian measurements 


s-sparse vector in W 


l\ norm 


2s\og{p/s) + 5s/4 


m x m rank-r matrix 


nuclear norm 


3r(2m — r) 


sign-vector {— 1,+1} P 


£oo norm 


p/2 


m x m permutation matrix 


norm induced by Birkhoff polytope 


9mlog(m) 


m x m orthogonal matrix 


spectral norm 


(3m 2 - m)/4 



Table 1: A summary of the recovery bounds obtained using Gaussian width arguments. 



to the recovery of individual atoms from random measurements, our techniques are more generally 
applicable to the recovery of models formed as sums of a few atoms as well. We also give tighter 
bounds than those previously obtained on the number of measurements required to robustly recover 
sparse vectors and low-rank matrices via l\ norm and nuclear norm minimization. In all of the 
cases we investigate, we find that the number of measurements required to reconstruct an object 
is proportional to its intrinsic dimension rather than the ambient dimension, thus confirming prior 
folklore. See Table [U for a summary of these results. 

Although our conditions for recovery and bounds on the number of measurements hold gener- 
ally, we note that it may not be possible to obtain a computable representation for the convex hull 
conv(.A) of an arbitrary set of atoms A. This leads us to another important theme of this paper, 
which we discuss in Section [4j on the connection between algebraic structure in A and the semidef- 
inite represent ability of the convex hull conv(^4). In particular when A is an algebraic variety the 
convex hull conv(„4.) can be approximated as (the projection of) a set defined by linear matrix 
inequalities. Thus the resulting atomic norm minimization heuristic can be solved via semidefinite 
programming. A second issue that arises in practice is that even with algebraic structure in A the 
semidefinite representation of conv(^l) may not be computable in polynomial time, which makes the 
atomic norm minimization problem intractable to solve. A prominent example here is the tensor 
nuclear norm ball, obtained by taking the convex hull of the rank-one tensors. In order to address 
this problem we give a hierarchy of semidefinite relaxations using theta bodies that approximate 
the original (intractable) atomic norm minimization problem [38]. We also highlight that while 
these semidefinite relaxations are more tractable to solve, we require more measurements for exact 
recovery of the underlying model than if we solve the original intractable atomic norm minimiza- 
tion problem. Hence there is a tradeoff between the complexity of the recovery algorithm and the 
number of measurements required for recovery. We illustrate this tradeoff with the cut polytope 
and its relaxations. 

Outline Section [2] describes the construction of the atomic norm, gives several examples of 
applications in which these norms may be useful to recover simple models, and provides general 
conditions for recovery by minimizing the atomic norm. In Section [3] we investigate the number 
of generic measurements for exact or robust recovery using atomic norm minimization, and give 
estimates in a number of settings by analyzing the Gaussian width of certain tangent cones. We 
address the problem of semidefinite representability and tractable relaxations of the atomic norm 
in Section 01 Section [5] describes some algorithmic issues as well as a few simulation results, and 
we conclude with a discussion and open questions in Section El 

2 Atomic Norms and Convex Geometry 

In this section we describe the construction of an atomic norm from a collection of simple atoms. 
In addition we give several examples of atomic norms, and discuss their properties in the context 
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of solving ill-posed linear inverse problems. We denote the Euclidean norm by 



2.1 Definition 

Let A be a collection of atoms that is a compact subset of M p . We will assume throughout this 
paper that no element a G A lies in the convex hull of the other elements conv(.A\a), i.e., the 
elements of A are the extreme points of conv(^l). Let ||x||_4 denote the gauge of A [63 : 

||x||^ = inf{t > : x G t couv(A)}. (2) 

Note that the gauge is always a convex, extended-real valued function for any set A. By convention 
this function evaluates to +oo if x does not lie in the affine hull of conv(^4). We will assume 
without loss of generality that the centroid of conv(„4) is at the origin, as this can be achieved by 
appropriate recentering. With this assumption the gauge function can be rewritten as [10] : 

||x||_4 = inf < c a : x = c a a, c a > Va G A > . 

If A is centrally symmetric about the origin (i.e., a G A if and only if —a G .4.) we have that || • ||_4 
is a norm, which we call the atomic norm induced by A. The support function of A is given as: 



sup {(x, a) : a G .4} . (3) 



If || • H.4 is a norm the support function || • ||^ is the dual norm of this atomic norm. From this 
definition we see that the unit ball of || • ||^ is equal to conv(^4). In many examples of interest 
the set A is not centrally symmetric, so that the gauge function does not define a norm. However 
our analysis is based on the underlying convex geometry of conv(.A), and our results are applicable 
even if || • ||_4 does not define a norm. Therefore, with an abuse of terminology we generally refer 
to || • 1 1.4 as the atomic norm of the set A even if || • ||^ is not a norm. We note that the duality 
characterization between ([2]) and ([3]) when || • ||^ is a norm is in fact applicable even in infinite- 
dimensional Banach spaces by Bonsall's atomic decomposition theorem [TO], but our focus is on the 
finite-dimensional case in this work. We investigate in greater detail the issues of represent ability 
and efficient approximation of these atomic norms in Section [U 

Equipped with a convex penalty function given a set of atoms, we propose a convex optimization 
method to recover a "simple" model given limited linear measurements. Specifically suppose that 
x* is formed according to (JJ) from a set of atoms A. Further suppose that we have a known linear 
map 3> : M p — > W 1 , and we have linear information about x* as follows: 

y = $x\ (4) 

The goal is to reconstruct x* given y. We consider the following convex formulation to accomplish 
this task: 

x = argmin ||x|| a 

(5) 

s.t. y = <£x. 

When A is the set of one-sparse atoms this problem reduces to standard l\ norm minimization. 
Similarly when A is the set of rank-one matrices this problem reduces to nuclear norm minimization. 
More generally if the atomic norm || • ||.4 is tractable to evaluate, then ([5]) potentially offers an efficient 
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convex programming formulation for reconstructing x* from the limited information y. The dual 
problem of ([5]) is given as follows: 

T 

max y z 

(6) 

s.t. ||$t z ||^<i. 

Here <&t denotes the adjoint (or transpose) of the linear measurement map <3?. 

The convex formulation ([5]) can be suitably modified in case we only have access to inaccurate, 
noisy information. Specifically suppose that we have noisy measurements y = <£x* + u where u 
represents the noise term. A natural convex formulation is one in which the constraint y = <3?x of 
§S§ is replaced by the relaxed constraint ||y — <5x|| < 5, where 5 is an upper bound on the size of 
the noise u: 

x = argmin \\x\\a 

x " (7) 

s.t. ||y - $x|| < 5. 

We say that we have exact recovery in the noise- free case if x = x* in ([5]) , and robust recovery in 
the noisy case if the error ||x — x*|| is small in ([7]). In Section [2.41 and Section [3] we give conditions 
under which the atomic norm heuristics ([5]) and ([7]) recover x* exactly or approximately. Atomic 
norms have found fruitful applications in problems in approximation theory of various function 
classes [581 143| El |24] . However this prior body of work was concerned with infinite-dimensional 
Banach spaces, and none of these references consider nor provide recovery guarantees that are 
applicable in our setting. 

2.2 Examples 

Next we provide several examples of atomic norms that can be viewed as special cases of the 
construction above. These norms are obtained by convexifying atomic sets that are of interest in 
various applications. 

Sparse vectors. The problem of recovering sparse vectors from limited measurements has 
received a great deal of attention, with applications in many problem domains. In this case the 
atomic set ^4cR p can be viewed as the set of unit-norm one-sparse vectors {rtej}?-,, and fc-sparse 
vectors in R p can be constructed using a linear combination of k elements of the atomic set. In this 
case it is easily seen that the convex hull conv(„4) is given by the cross-polytope (i.e., the unit ball 
of the l\ norm), and the atomic norm || • ||_4 corresponds to the l\ norm in W . 

Low-rank matrices. Recovering low-rank matrices from limited information is also a problem 
that has received considerable attention as it finds applications in problems in statistics, control, 
and machine learning. The atomic set A here can be viewed as the set of rank-one matrices of 
unit-Euclidean-norm. The convex hull conv(.A) is the nuclear norm ball of matrices in which the 
sum of the singular values is less than or equal to one. 

Sparse and low-rank matrices. The problem of recovering a sparse matrix and a low- 
rank matrix given information about their sum arises in a number of model selection and system 
identification settings. The corresponding atomic norm is constructed by taking the convex hull of 
an atomic set obtained via the union of rank-one matrices and (suitably scaled) one-sparse matrices. 
This norm can also be viewed as the infimal convolution of the l\ norm and the nuclear norm, and 
its properties have been explored in [T9| [Ti"]. 

Permutation matrices. A problem of interest in a ranking context [22] or an object tracking 
context is that of recovering permutation matrices from partial information. Suppose that a small 
number k of rankings of m candidates is preferred by a population. Such preferences can be 
modeled as the sum of a few m x m permutation matrices, with each permutation corresponding 
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to a particular ranking. By conducting surveys of the population one can obtain partial linear 
information of these preferred rankings. The set A here is the collection of permutation matrices 
(consisting of ml elements), and the convex hull conv(„4) is the Birkhoff polytope or the set of 
doubly stochastic matrices [73] . The centroid of the Birkhoff polytope is the matrix 11 T /m, so 
it needs to be recentered appropriately. We mention here recent work by Jagabathula and Shah 
|42j on recovering a sparse function over the symmetric group (i.e., the sum of a few permutation 
matrices) given partial Fourier information; although the algorithm proposed in [32] is tractable it 
is not based on convex optimization. 

Binary vectors. In integer programming one is often interested in recovering vectors in which 
the entries take on values of ±1. Suppose that there exists such a sign-vector, and we wish to 
recover this vector given linear measurements. This corresponds to a version of the multi- knapsack 
problem [51] . In this case A is the set of all sign- vectors, and the convex hull conv(^4) is the 
hypercube or the unit ball of the norm. The image of this hypercube under a linear map is also 
referred to zonotope [73] . 

Vectors from lists. Suppose there is an unknown vector x 6 R p , and that we are given the 
entries of this vector without any information about the locations of these entries. For example if 
x = [3 1 2 2 4]', then we are only given the list of numbers {1, 2, 2, 3, 4} without their positions in 
x. Further suppose that we have access to a few linear measurements of x. Can we recover x by 
solving a convex program? Such a problem is of interest in recovering partial rankings of elements 
of a set. An extreme case is one in which we only have two preferences for rankings, i.e., a vector 
in {1,2} P composed only of one's and two's, which reduces to a special case of the problem above 
of recovering binary vectors (in which the number of entries of each sign is fixed). For this problem 
the set A is the set of all permutations of x (which we know since we have the list of numbers 
that compose x), and the convex hull conv(„4) is the permutahedron [731 165] . As with the Birkhoff 
polytope, the permutahedron also needs to be recentered about the point l T x/p. 

Matrices constrained by eigenvalues. This problem is in a sense the non-commutative 
analog of the one above. Suppose that we are given the eigenvalues A of a symmetric matrix, but 
no information about the eigenvectors. Can we recover such a matrix given some additional linear 
measurements? In this case the set A is the set of all symmetric matrices with eigenvalues A, and 
the convex hull conv(„4) is given by the Schur-Horn orbitope [65 1 . 

Orthogonal matrices. In many applications matrix variables are constrained to be orthogo- 
nal, which is a non-convex constraint and may lead to computational difficulties. We consider one 
such simple setting in which we wish to recover an orthogonal matrix given limited information in 
the form of linear measurements. In this example the set A is the set ofmxm orthogonal matrices, 
and conv(„4) is the spectral norm ball. 

Measures. Recovering a measure given its moments is another question of interest that arises in 
system identification and statistics. Suppose one is given access to a linear combination of moments 
of an atomically supported measure. How can we reconstruct the support of the measure? The 
set A here is the moment curve, and its convex hull conv(„4) goes by several names including the 
Caratheodory orbitope [65]. Discretized versions of this problem correspond to the set A being a 
finite number of points on the moment curve; the convex hull conv(^l) is then a cyclic polytope [73 1 . 

Cut matrices. In some problems one may wish to recover low-rank matrices in which the 
entries are constrained to take on values of ±1. Such matrices can be used to model basic user 
preferences, and are of interest in problems such as collaborative filtering [66] , The set of atoms 
A could be the set of rank-one signed matrices, i.e., matrices of the form zz T with the entries of z 
being ±1. The convex hull conv(^l) of such matrices is the cut polytope [22]. An interesting issue 
that arises here is that the cut polytope is in general intractable to characterize. However there 
exist several well-known tractable semidefinite relaxations to this polytope [23 [36] , and one can 
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employ these in constructing efficient convex programs for recovering cut matrices. We discuss this 
point in greater detail in Section T4.3I 

Low-rank tensors. Low-rank tensor decompositions play an important role in numerous 
applications throughout signal processing and machine learning [46]. Developing computational 
tools to recover low-rank tensors is therefore of great interest. In principle we could solve a tensor 
nuclear norm minimization problem, in which the tensor nuclear norm ball is obtained by taking 
the convex hull of rank-one tensors. A computational challenge here is that the tensor nuclear norm 
is in general intractable to compute; in order to address this problem we discuss further convex 
relaxations to the tensor nuclear norm using theta bodies in Section 0J A number of additional 
technical issues also arise with low-rank tensors including the non-existence in general of a singular 
value decomposition analogous to that for matrices [35], and the difference between the rank of a 
tensor and its border rank |23j . 

Nonorthogonal factor analysis. Suppose that a data matrix admits a factorization X = AB. 
The matrix nuclear norm heuristic will find a factorization into orthogonal factors in which the 
columns of A and rows of B are mutually orthogonal. However if a priori information is available 
about the factors, precision and recall could be improved by enforcing such priors. These priors 
may sacrifice orthogonality, but the factors might better conform with assumptions about how the 
data are generated. For instance in some applications one might know in advance that the factors 
should only take on a discrete set of values |66j . In this case, we might try to fit a sum of rank-one 
matrices that are bounded in £qq norm rather than in £2 norm. Another prior that commonly arises 
in practice is that the factors are non- negative (i.e., in non- negative matrix factorization). These 
and other priors on the basic rank-one summands induce different norms on low-rank models than 
the standard nuclear norm [33], and may be better suited to specific applications. 

2.3 Background on Tangent and Normal Cones 

In order to properly state our results, we recall some basic concepts from convex analysis. A convex 
set C is a cone if it is closed under positive linear combinations. The polar C* of a cone C is the 
cone 



Given some nonzero x G W we define the tangent cone at x with respect to the scaled unit ball 
||x||_4Conv(^4) as 



The cone T_4(x) is equal to the set of descent directions of the atomic norm || • ||_4 at the point x, 
i.e., the set of all directions d such that the directional derivative is negative. 

The normal cone A^(x) at x with respect to the scaled unit ball ||x||_4Conv(^4) is defined to be 
the set of all directions s that form obtuse angles with every descent direction of the atomic norm 
|| • ||_4 at the point x: 



The normal cone is equal to the set of all normals of hyperplanes given by normal vectors s that 
support the scaled unit ball ||x||_4Conv(„4.) at x. Observe that the polar cone of the tangent cone 
T_4(x) is the normal cone A^(x) and vice- versa. Moreover we have the basic characterization that 
the normal cone A^(x) is the conic hull of the subdifferential of the atomic norm at x. 



C* = {x G R p : (x,z) < Vz G C}. 





A^(x) = {s : (s, z — x) < Vz s.t. ||z||_4 < ||x||_4}. 



(9) 



S 



2.4 Recovery Condition 

The following result gives a characterization of the favorable underlying geometry required for exact 
recovery. Let null($) denote the nullspace of the operator <£. 

Proposition 2.1. We have that x = x* is the unique optimal solution of ([5]) if and only if 
null($)nT4(x*) = {0}. 

Proof. Eliminating the equality constraints in ([5]) we have the equivalent optimization problem 

min ||x* + d||,4 s.t. d G null($). 

d 

Suppose null(3>) n T^(x*) = {0}. Since ||x* + d||_4 < ||x*|| implies d G T^(x*), we have that 
|| x * + d||_4 > ||x*||_4 for all d G null($) \ {0}. Conversely x* is the unique optimal solution of ([5]) 
if ||x* + d||^ > ||x*||^ for all d G null($) \ {0}, which implies that d G" T A (x*). □ 

Proposition 12. II asserts that the atomic norm heuristic succeeds if the nullspace of the sampling 
operator does not intersect the tangent cone T_4(x*) at x*. In Section[3]we provide a characterization 
of tangent cones that determines the number of Gaussian measurements required to guarantee such 
an empty intersection. 

A tightening of this empty intersection condition can also be used to address the noisy approx- 
imation problem. The following proposition characterizes when x* can be well- approximated using 
the convex program ([7]). 

Proposition 2.2. Suppose that we are given n noisy measurements y = <£x* + to where \\io\\ < 5, 
and <3? : MP — > M. n . Let x denote an optimal solution of ([7]). Further suppose for all z G T_4(x*) 
that we have ||3>z[| > e||z||. Then ||x — x*|| < 

Proof. The set of descent directions at x* with respect to the atomic norm ball is given by the 
tangent cone T_4(x*). The error vector x — x* lies in T^(x*) because x is a minimal atomic norm 
solution, and hence ||x||_4 < ||x*||_4. It follows by the triangle inequality that 

||$(x - x*)|| < ||$x - y|| + || $x* - y|| < 25. (10) 

By assumption we have that 

||$(x-x*)|| > e||x-x*||, (11) 

which allows us to conclude that llx — x* || < — . □ 

Therefore, we need only concern ourselves with estimating the minimum value of ^pjp for non- 
zero z G T_4(x*). We denote this quantity as the minimum gain of the measurement operator 
$ restricted to the cone T_4(x*). In particular if this minimum gain is bounded away from zero, 
then the atomic norm heuristic also provides robust recovery when we have access to noisy linear 
measurements of x*. Minimum gain conditions have been employed in recent recovery results via t\- 
norm minimization, block-sparse vector recovery, and low-rank matrix reconstruction [18 t [8|[T0" l l53]. 
All of these results rely heavily on strong decomposability conditions of the l\ norm and the matrix 
nuclear norm. However, there are several examples of atomic norms (for instance, the norm 
and the norm induced by the Birkhoff polytope) as specified in Section T2.2I that do not satisfy such 
decomposability conditions. As well see in the sequel the more geometric viewpoint adopted in this 
paper provides a fruitful framework in which to analyze the recovery properties of general atomic 
norms. 
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2.5 Why Atomic Norm? 

The atomic norm induced by a set A possesses a number of favorable properties that are useful 
for recovering "simple" models from limited linear measurements. The key point to note from 
Section [2,41 is that the smaller the tangent cone at a point x* with respect to conv(„4), the easier 
it is to satisfy the empty-intersection condition of Proposition 12,11 

Based on this observation it is desirable that points in conv(.A) with smaller tangent cones cor- 
respond to simpler models, while points in conv(^4) with larger tangent cones generally correspond 
to more complicated models. The construction of conv(„4) by taking the convex hull of A ensures 
that this is the case. The extreme points of conv(.A) correspond to the simplest models, i.e., those 
models formed from a single element of A. Further the low-dimensional faces of conv(^4.) consist 
of those elements that are obtained by taking linear combinations of a few basic atoms from A. 
These are precisely the properties desired as points lying in these low-dimensional faces of conv(„4) 
have smaller tangent cones than those lying on larger faces. 

We also note that the atomic norm is, in some sense, the best possible convex heuristic for 
recovering simple models. Any reasonable heuristic penalty function should be constant on the set 
of atoms A. This ensures that no atom is preferred over any other. Under this assumption, we 
must have that for any a G A, a' — a must be a descent direction for all a' G A. The best convex 
penalty function is one in which the cones of descent directions at a G A are as small as possible. 
This is because, as described above, smaller cones are more likely to satisfy the empty intersection 
condition required for exact recovery. Since the tangent cone at a G A with respect to conv(^4.) 
is precisely the conic hull of a' — a for a' G A, the atomic norm is the best convex heuristic for 
recovering models where simplicity is dictated by the set A. 

Our reasons for proposing the atomic norm as a useful convex heuristic are quite different from 
previous justifications of the i\ norm and the nuclear norm. In particular let / : MP — > M denote the 
cardinality function that counts the number of nonzero entries of a vector. Then the t\ norm is the 
convex envelope of / restricted to the unit ball of the ioo norm, i.e., the best convex underestimator 
of / restricted to vectors in the ^oo-norm ball. This view of the t\ norm in relation to the function 
/ is often given as a justification for its effectiveness in recovering sparse vectors. However if we 
consider the convex envelope of / restricted to the Euclidean norm ball, then we obtain a very 
different convex function than the t\ norm! With more general atomic sets, it may not be clear a 
priori what the bounding set should be in deriving the convex envelope. In contrast the viewpoint 
adopted in this paper leads to a natural, unambiguous construction of the t\ norm and other general 
atomic norms. Further as explained above it is the favorable facial structure of the atomic norm 
ball that makes the atomic norm a suitable convex heuristic to recover simple models, and this 
connection is transparent in the definition of the atomic norm. 

3 Recovery from Generic Measurements 

We consider the question of using the convex program ([5]) to recover "simple" models formed 
according to (HJ from a generic measurement operator or map $ : MP — > M n . Specifically, we wish 
to compute estimates on the number of measurements n so that we have exact recovery using ([5]) for 
most operators comprising of n measurements. That is, the measure of n-measurement operators for 
which recovery fails using ([5]) must be exponentially small. In order to conduct such an analysis we 
study random Gaussian maps whose entries are independent and identically distributed Gaussians. 
These measurement operators have a nullspace that is uniformly distributed among the set of all 
(p — n)-dimensional subspaces in MP. In particular we analyze when such operators satisfy the 
conditions of Proposition 12.11 and Proposition 12.21 for exact recovery. 
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3.1 Recovery Conditions based on Gaussian Width 



Proposition 12.11 requires that the nullspace of the measurement operator 3> must miss the tangent 
cone T_4(x*). Gordon [37] gave a solution to the problem of characterizing the probability that 
a random subspace (of some fixed dimension) distributed uniformly misses a cone. We begin by 
defining the Gaussian width of a set, which plays a key role in Gordon's analysis. 

Definition 3.1. The Gaussian width of a set S CMP is defined as: 



where g ~ A^(0, /) is a vector of independent zero-mean unit-variance Gaussians. 

Gordon characterized the likelihood that a random subspace misses a cone C purely in terms of 
the dimension of the subspace and the Gaussian width w(C C\ § p_1 ), where S p_1 C MP is the unit 
sphere. Before describing Gordon's result formally, we introduce some notation. Let A^ denote 
the expected length of a A;-dimensional Gaussian random vector. By elementary integration, we 
have that X k = v / 2r(^±i)/r(|). Further by induction one can show that A^ is tightly bounded as 



The main idea underlying Gordon's theorem is a bound on the minimum gain of an operator 
restricted to a set. Specifically, recall that null^) n T_4(x*) = {0} is the condition required for 
recovery by Proposition 12.11 Thus if we have that the minimum gain of $ restricted to vectors in 
the set T_a(x*) n S p_1 is bounded away from zero, then it is clear that null(<I>) n T^{x*) = 0. We 
refer to such minimum gains restricted to a subset of the sphere as restricted minimum singular 
values, and the following theorem of Gordon gives a bound these quantities: 

Theorem 3.2 (Cor. 1.2, |37J). Let Q be a closed subset of®?' 1 . Let $ : W -)• M n be a random 
map with i.i.d. zero-mean Gaussian entries having variance one. Then 



Theorem 13.21 allows us to characterize exact recovery in the noise-free case using the convex 
program ([5]), and robust recovery in the noisy case using the convex program (J7J). Specifically, we 
consider the number of measurements required for exact or robust recovery when the measurement 
map <3? : W — > M n consists of i.i.d. zero-mean Gaussian entries having variance 1/n. The normal- 
ization of the variance ensures that the columns of $ are approximately unit-norm, and is necessary 
in order to properly define a signal-to-noise ratio. The following corollary summarizes the main 
results of interest in our setting: 

Corollary 3.3. Let $ : MP — > M n be a random map with i.i.d. zero-mean Gaussian entries having 
variance 1/n. Further let Q, = T^(x*) n § p_1 denote the spherical part of the tangent cone T^{x.*). 

1. Suppose that we have measurements y = <l?x* and solve the convex program ([5]). Then x* is 
the unique optimum of ([5]) with probability at least 1 — exp f — | [A n — w(£l)] 2 J provided 



w(S) := Eg sup g z , 
LzeS 




A; 



< A fc < Vk. 



E min||$z|| 2 >X n — w(Q). 



(12) 



n > w(tt) 2 + 1 . 
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2. Suppose that we have noisy measurements y = ^x* + u, with the noise u bounded as \\u)\\ < 5, 
and that we solve the convex program (J7|). Letting x denote the optimal solution of ([7]), we 
have that ||x* — x|| < ^ with probability at least 1 — exp ^— ^ [A n — w(Q) — -y/ne] 2 ^ provided 



n > 



w(n) 2 + 3/2 
(1 - e) 2 



Proof. The two results are simple consequences of Theorem 13.21 and a concentration of measure 
argument. Recall that for an function / : M. d — > R with Lipschitz constant L and a random 
Gaussian vector, g G with mean zero and identity variance 



'[/(g)>E[/]-t]>l-exp 



2L 2 I 



(13) 



(see, for example, [481 159] ). For any set Q C S p , the function 



$ i — y min H^z^ 

is Lipschitz with respect to the Frobenius norm with constant 1. Thus, applying 13.21 and (|13p . we 
find that 

(14) 



min 1 1 <I>z 1 1 2 > e 



> 1 - exp (-5 (A„ " " v^e) 2 ) 



provided that A n — w(Q) — ^fne > 0. 

The first part now follows by setting e = in (|14p . The concentration inequality is valid 
provided that A n > w(£l). To verify this, note 



A n > 



n 



> 



w(n) 2 + i w {n) 2 + w(n) 2 /n 



y/nTT ~ V 1 + 1/"- 



> 



1 + 1/n 



Here, both inequalities use the fact that n > w(Q) 2 + 1. 
For the second part, we have from ()14[) that 



\$(z) 



> e z 



for all z £ 7^(x*) with high probability if X n > w(fl) + y/ne. In this case, we can apply Proposi- 
tion E2] to conclude that ||x — x*|| < To verify that concentration of measure can be applied 
is more or less the same as in the proof of Part 1. First, note that under the assumptions of the 
theorem 

,2 / \2 

w(Q) 2 + 1 < nil - e) 2 - 1/2 < nil - e) 2 - 2e(l - e) + — = y/n(\ - e) - -=) 

n \ ^JnJ 

as e(l — e) < 1/4 for e G (0, 1). Using this fact, we then have 



n _( n + l) e W (Q) 2 + 1 
A„ - V^e > — > W — 

V" + 1 V 1 + 1/" 



as desired. 



□ 
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Gordon's theorem thus provides a simple characterization of the number of measurements re- 
quired for reconstruction with the atomic norm. Indeed the Gaussian width of Q, = T_4(x*) n EP -1 
is the only quantity that we need to compute in order to obtain bounds for both exact and robust 
recovery. Unfortunately it is in general not easy to compute Gaussian widths. Rudelson and Ver- 
shynin [63] have worked out Gaussian widths for the special case of tangent cones at sparse vectors 
on the boundary of the i\ ball, and derived results for sparse vector recovery using l\ minimization 
that improve upon previous results. In the next section we give various well-known properties of 
the Gaussian width that are useful in computations. In Section 13.31 we discuss a new approach to 
width computations that gives near-optimal recovery bounds in a variety of settings. 

3.2 Properties of Gaussian Width 

The Gaussian width has deep connections to convex geometry. Since the length and direction of a 
Gaussian random vector are independent, one can verify that for S CMP 



where the integral is with respect to Haar measure on S^ -1 and b(S) is known as the mean width of 
S. The mean width measures the average length of S along unit directions in MP and is one of the 
fundamental intrinsic volumes of a body studied in combinatorial geometry [44J. Any continuous 
valuation that is invariant under rigid motions and homogeneous of degree 1 is a multiple of the 
mean width and hence a multiple of the Gaussian width. We can use this connection with convex 
geometry to underscore several properties of the Gaussian width that are useful for computation. 

The Gaussian width of a body is invariant under translations and unitary transformations. 
Moreover, it is homogeneous in the sense that w(tK) = tw(K) for t > 0. The width is also 
monotonic. If S\ C S 2 Q M p , then it is clear from the definition of the Gaussian width that 



Less obvious, the width is modular in the sense that if Si and S 2 are convex bodies with Si U S 2 
convex, we also have 



This equality follows from the fact that w is a valuation [4j. Also note that if we have a set S C M p , 
then the Gaussian width of S is equal to the Gaussian width of the convex hull of S: 



This result follows from the basic fact in convex analysis that the maximum of a convex function 
over a convex set is achieved at an extreme point of the convex set. 
If V C W is a subspace in MP, then we have that 



which follows from standard results on random Gaussians. This result also agrees with the intuition 
that a random Gaussian map $> misses a fc-dimensional subspace with high probability as long as 
dim(null($)) > k + 1. Finally, if a cone S C M p is such that S = Si © S 2 , where Si C M p is a 
/c-dimensional cone, S2 C MP is a (p— /c)-dimensional cone that is orthogonal to Si, and © denotes 
the direct sum operation, then the width can be decomposed as follows: 




w(Si) < w(S 2 ). 



w{Si U 5 2 ) + w{Si n S 2 ) = w(Si) + w(S 2 ) . 



w(S) = w(conv(S)). 



w(S n s^ 1 ) 2 < w(Si n s?- 1 ) 2 + w(S 2 n sr 1 ) 2 . 
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These observations are useful in a variety of situations. For example a width computation that 
frequently arises is one in which S = S\ © S2 as described above, with S\ being a /c-dimensional 
subspace. It follows that the width of S n S^ -1 is bounded as 



w(S n gp- 1 ) 2 < k + w{S 2 n sp- 1 ) 2 . 



(15) 



Another tool for computing Gaussian widths is based on Dudley's inequality [HI [48], which 
bounds the width of a set in terms of the covering number of the set at all scales. 

Definition 3.4. Let S be an arbitrary compact subset of MP. The covering number of S in the 
Euclidean norm at resolution e is the smallest number, %l(S, e), such that yi(S, e) Euclidean balls 
of radius e cover S. 

Theorem 3.5 (Dudley's Inequality). Let S be an arbitrary compact subset ofW, and let g be a 
random vector with i.i.d. zero-mean, unit-variance Gaussian entries. Then 



We note here that a weak converse to Dudley's inequality can be obtained via Sudakov's Mino- 
ration [48] by using the covering number for just a single scale. Specifically, we have the following 
lower bound on the Gaussian width of a compact subset ScM 1 ' for any e > 0: 



Here c > is some universal constant. 

Although Dudley's inequality can be applied quite generally, estimating covering numbers is dif- 
ficult in most instances. There are a few simple characterizations available for spheres and Sobolev 
spaces, and some tractable arguments based on Maurey's empirical method [48]. However it is not 
evident how to compute these numbers for general convex cones. Also, in order to apply Dudley's 
inequality we need to estimate the covering number at all scales. Further Dudley's inequality can 
be quite loose in its estimates, and it often introduces extraneous polylogarithmic factors. In the 
next section we describe a new mechanism for estimating Gaussian widths, which provides near- 
optimal guarantees for recovery of sparse vectors and low-rank matrices, as well as for several of 
the recovery problems discussed in Section [3.41 

3.3 New Results on Gaussian Width 

We now present a framework for computing Gaussian widths by bounding the Gaussian width of 
a cone via the distance to the dual cone. To be fully general let C be a non-empty convex cone in 
MP, and let C* denote the polar of C. We can then upper bound the Gaussian width of any cone C 
in terms of the polar cone C*: 

Proposition 3.6. Let C be any non-empty convex cone in M p , and let g ~ A/"(0, /) be a random 
Gaussian vector. Then we have the following bound: 




(16) 



w(S) > ceVM^e)). 



w{C n S p " 



) <E g [dist(g,C*)] 



where dist here denotes the Euclidean distance between a point and a set. 
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The proof is given in Appendix [Aj and it follows from an appeal to convex duality. Propo- 
sition 13.61 is more or less a restatement of the fact that the support function of a convex cone is 
equal to the distance to its polar cone. As it is the square of the Gaussian width that is of inter- 
est to us (see Corollary I3.3H . it is often useful to apply Jensen's inequality to make the following 
approximation: 

E g [dist(g,C*)] 2 <E g [dist(g,C*) 2 ]. (17) 

The inspiration for our characterization in Proposition 13.61 of the width of a cone in terms of 
the expected distance to its dual came from the work of Stojnic [67], who used linear programming 
duality to construct Gaussian-width-based estimates for analyzing recovery in sparse reconstruction 
problems. Specifically, Stojnic's relatively simple approach recovered well-known phase transitions 
in sparse signal recovery [28], and also generalized to block sparse signals and other forms of 
structured sparsity. 

This new dual characterization yields a number of useful bounds on the Gaussian width, which 
we describe here. In the following section we use these bounds to derive new recovery results. The 
first result is a bound on the Gaussian width of a cone in terms of the Gaussian width of its polar. 

Lemma 3.7. Let C QMP be a non-empty closed, convex cone. Then we have that 

w{C n s^ 1 ) 2 + w{C* n s^ 1 ) 2 < P . 

Proof. Combining Proposition 13.61 and (I17D . we have that 

w(C n S^ 1 ) 2 < E g [dist(g,C*) 2 ] , 

where as before g ~ A/"(0, /). For any z£K p we let rLj(z) = arg inf ug c || z — u || denote the projection 
of z onto C. From standard results in convex analysis [63], we note that one can decompose any 
z E MP into orthogonal components as follows: 

z = n c (z) + n c .(z), (n c (z),n c ,(z)> = o. 

Therefore we have the following sequence of bounds: 



(CnS^ 1 ) 2 < E g [dist(g,C 

= E g [l|n c (g)ll 2 



*\21 



= E g [||g|| 2 -||n c ,(g)|| 2 ] 

= p-Eg[||n c .(g)|| 2 ] 

= p-Eg [dist(g,C) 2 j 

< P -w(c* nsp- 1 ) 2 . 



□ 



In many recovery problems one is interested in computing the width of a self-dual cone. For 
such cones the following corollary to Lemma 13.71 gives a simple solution: 

Corollary 3.8. Let C <zW be a self-dual cone, i.e., C = —C*. Then we have that 

Proof. The proof follows directly from LemmaO as w{C n S^ 1 ) 2 = w{C* n S^ 1 ) 2 . □ 
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Our next bound for the width of a cone C is based on the volume of its polar C* n EP . The 
volume of a measurable subset of the sphere is the fraction of the sphere § p_1 covered by the subset. 
Thus it is a quantity between zero and one. 

Theorem 3.9 (Gaussian width from volume of the polar). Let C C W be any closed, convex, solid 
cone, and suppose that its polar C* is such that C* n has a volume of © G [0, 1]. Then for 
p > 9 we have that 



The proof of this theorem is given in Appendix [Bj The main property that we appeal to in 
the proof is Gaussian isoperimetry. In particular there is a formal sense in which a spherical cap3 
is the "extremal case" among all subsets of the sphere with a given volume 0. Other than this 
observation the proof mainly involves a sequence of integral calculations. 

Note that if we are given a specification of a cone C C R p in terms of a membership oracle, 
it is possible to efficiently obtain good numerical estimates of the volume of C n § p_1 [32]. More- 
over, simple symmetry arguments often give relatively accurate estimates of these volumes. Such 
estimates can then be plugged into Theorem 13.91 to yield bounds on the width. 

3.4 New Recovery Bounds 

We use the bounds derived in the last section to obtain new recovery results. First using the dual 
characterization of the Gaussian width in Proposition l3.61 we are able to obtain sharp bounds on the 
number of measurements required for recovering sparse vectors and low-rank matrices from random 
Gaussian measurements using convex optimization (i.e., ^i-norm and nuclear norm minimization). 

Proposition 3.10. Let x* G M. p be an s-sparse vector. Letting A denote the set of unit- Euclidean- 
norm one-sparse vectors, we have that 



mization with high probability. 

Proposition 3.11. Let x* be an mi x ni2 rank-r matrix with mi < m%. Letting A denote the set 
of unit-Euclidean-norm rank-one matrices, we have that 



Thus 3r(mi + m% — r) + 1 random Gaussian measurements suffice to recover x* via nuclear norm 
minimization with high probability. 

The proofs of these propositions are given in Appendix[Cj The number of measurements required 
by these bounds is on the same order as previously known results. In the case of sparse vectors, 
previous results getting 2slog(p/s) were asymptotic [29]. Our bounds, in contrast, hold with high 
probability in finite dimensions. In the case of low-rank matrices, our bound provides considerably 
sharper constants than those previously derived (as in, for example |15j). We also note that we 
have robust recovery at these thresholds. Further these results do not require explicit recourse to 

*A spherical cap is a subset of the sphere obtained by intersecting the sphere § p_1 with a halfspace. 





w 



{T A (x*)nS 



m\m2 — 1 



) < 3r(mi + 771,2 — r )- 
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any type of restricted isometry property [IS], and the proofs are simple and based on elementary 
integrals. 

Next we obtain a set of recovery results by appealing to Corollarv l3.8l on the width of a self-dual 
cone. These examples correspond to the recovery of individual atoms (i.e., the extreme points of 
the set conv(^4)), although the same machinery is applicable in principle to estimate the number of 
measurements required to recover models formed as sums of a few atoms (i.e., points lying on low- 
dimensional faces of conv(.A)). We first obtain a well-known result on the number of measurements 
required for recovering sign-vectors via £qo norm minimization. 

Proposition 3.12. Let A G {— 1,+1} P be the set of sign-vectors in MP. Suppose x* G R p is a 
vector formed as a convex combination of k sign-vectors in A such that x* lies on a k-face of the 
loo-norm unit ball. Then we have that 



Thus random Gaussian measurements suffice to recover x* via loo-norm minimization with 
high probability. 

Proof. The tangent cone at x* with respect to the ^-norm ball is the direct sum of a /c-dimensional 
subspace and a (rotated) (p — /c)-dimensional nonnegative orthant. As the orthant is self-dual, we 
obtain the required bound by combining Corollary 13.81 and (fT5j) . □ 

This result agrees with previously computed bounds in |51| I30|. which relied on a more compli- 
cated combinatorial argument. Next we compute the number of measurements required to recover 
orthogonal matrices via spectral- norm minimization (see Section f2.2|) . Let O(m) denote the group 
ofmxm orthogonal matrices, viewed as a subgroup of the set of nonsingular matrices in M mxm . 

Proposition 3.13. Let x* G jg^xm ^ e an or ffi g 0na l matrix, and let A be the set of all orthogonal 
matrices. Then we have that 



_a 1.9 3m 2 — m 

3m?— m 



w(T A (x*)n§ m ~ L y < 



Thus m 4 m random Gaussian measurements suffice to recover x* via spectral-norm minimization 
with high probability. 

Proof. Due to the symmetry of the orthogonal group, it suffices to consider the tangent cone at the 
identity matrix I with respect to the spectral norm ball. Recall that the spectral norm ball is the 
convex hull of the orthogonal matrices. Therefore the tangent space at the identity matrix with 
respect to the orthogonal group O(m) is a subset of the tangent cone T A {L). It is well-known that 
this tangent space is the Lie Algebra of all m x m skew-symmetric matrices. Thus we only need to 
compute the component S of 7U(7) that lies in the subspace of symmetric matrices: 

S = cone{M — 7 : ||M||^i < 1, M symmetric} 

= cone{UDU T - UU T : \\D\\ A < 1, D diagonal, U G 0(m)} 
= cone{U(D - I)U T : \\D\\ A < 1, D diagonal, U G O(m)} 

= -psa 



Here PSD m denotes the set of m x m symmetric positive-semidefinite matrices. As this cone is 
self-dual, we can apply Corollary I3.8l in conjunction with the observations in Section [331 to conclude 
that 

, m /TN i, 9 fm\ 1 fm + 1\ 3m 2 — m 



17 



□ 

We note that the number of degrees of freedom in an m x m orthogonal matrix (i.e., the 
dimension of the manifold of orthogonal matrices) is m ^ ra ^ . Proposition 13. 121 and Proposition 13. 131 
point to the importance of obtaining recovery bounds with sharp constants. Larger constants in 
either result would imply that the number of measurements required exceeds the ambient dimension 
of the underlying x*. In these and many other cases of interest Gaussian width arguments not only 
give order-optimal recovery results, but also provide precise constants that result in sharp recovery 
thresholds. 

Finally we give a third set of recovery results that appeal to the Gaussian width bound of Theo- 
rem [331 The following measurement bound applies to cases when conv(.A) is a symmetric polytope 
(roughly speaking, all the vertices are "equivalent"), and is a simple corollary of Theorem 13.91 

Corollary 3.14. Suppose that the set A is a finite collection of m points, with the convex hull 
conv(*4) being a vertex-transitive polytope J73| / whose vertices are the points in A. Using the convex 
program ([5]) we have that 91og(m) random Gaussian measurements suffice, with high probability, 
for exact recovery of a point in A, i.e., a vertex of conv(A). 

Proof. We recall the basic fact from convex analysis that the normal cones at the vertices of a 
convex polytope in M p provide a partitioning of M. p . As conv(„4) is a vertex-transitive polytope, 
the normal cone at a vertex covers ^ fraction of W. Applying Theorem 13.91 we have the desired 
result. □ 

Clearly we require the number of vertices to be bounded as m < exp{|}, so that the estimate of 
the number of measurements is not vacuously true. This result has useful consequences in settings 
in which conv(„4) is a combinatorial polytope, as such polytopes are often vertex-transitive. We have 
the following example on the number of measurements required to recover permutation matrices^: 

Proposition 3.15. Let x* £ R mxm be a permutation matrix, and let A be the set of all m x m 

permutation matrices. Then 9mlog(m) random Gaussian measurements suffice, with high proba- 
bility, to recover x* by solving the optimization problem ([5]) ; which minimizes the norm induced by 
the Birkhoff polytope of doubly stochastic matrices. 

Proof. This result follows from Corollary 13.141 by noting that there are ml permutation matrices of 
size m x m. □ 



4 Representability and Algebraic Geometry of Atomic Norms 

All of our discussion thus far has focussed on arbitrary atomic sets A. As seen in Section [2] the 
geometry of the convex hull conv(„4) completely determines conditions under which exact recovery 
is possible using the convex program ([5]). In this section we address the question of computing 
atomic norms for general sets of atoms. These issues are critical in order to be able to solve the 
convex optimization problem ([5|). Although the convex hull conv(^4) is always a mathematically 
well-defined object, testing membership in this set is in general undecidable (for example, if A 
is a fractal). Further, even if these convex hulls are computable they may not admit efficient 
representations. For example if A is the set of rank-one signed matrices (see Section I2.2h . the 
corresponding convex hull conv(^4) is the cut polytope for which there is no known tractable 

2 While Proposition 13.151 follows as a consequence of the general result in Corollary 13.141 one can remove the 
constant factor 9 in the statement of Proposition [3~15l bv carrying out a more refined analysis of the Birkhoff polytope. 
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characterization. Consequently, one may have to resort to efficiently computable approximations of 
conv(.A). The tradeoff in using such approximations in our atomic norm minimization framework 
is that we require more measurements for robust recovery. This section is devoted to providing a 
better understanding of these issues. 

4.1 Role of Algebraic Structure 

In order to obtain exact or approximate representations (analogous to the cases of the l\ norm and 
the nuclear norm) it is important to identify properties of the atomic set A that can be exploited 
computationally. We focus on cases in which the set A has algebraic structure. Specifically let 
the ring of multivariate polynomials in p variables be denoted by R[x] = R[xi,. . . , x p ]. We then 
consider real algebraic varieties [9]: 

Definition 4.1. A real algebraic variety S C M p is the set of real solutions of a system of polynomial 
equations: 

s = { x : Sj( x ) = 0, Vj}, 
where {gj} is a finite collection of polynomials in R[x]. 

Indeed all of the atomic sets A considered in this paper are examples of algebraic varieties. 
Algebraic varieties have the remarkable property that (the closure of) their convex hull can be 
arbitrarily well-approximated in a constructive manner as (the projection of) a set defined by lin- 
ear matrix inequality constraints |381 157j . A potential complication may arise, however, if these 
semidefinite representations are intractable to compute in polynomial time. In such cases it is 
possible to approximate the convex hulls via a hierarchy of tractable semidefinite relaxations. We 
describe these results in more detail in Section 14.21 Therefore the atomic norm minimization prob- 
lems such as d?]) arising in such situations can be solved exactly or approximately via semidefinite 
programming. 

Algebraic structure also plays a second important role in atomic norm minimization problems. 
If an atomic norm || • is intractable to compute, we may approximate it via a more tractable norm 
II • \\app- However not every approximation of the atomic norm is equally good for solving inverse 
problems. As illustrated in Figure [2] we can construct approximations of the l\ ball that are tight 
in a metric sense, with (1 — e)|| • || app < || • < (l + e)|| • \\ ap p, but where the tangent cones at sparse 
vectors in the new norm are halfspaces. In such a case, the number of measurements required to 
recover the sparse vector ends up being on the same order as the ambient dimension. (Note that 
the £i-norm is in fact tractable to compute; we simply use it here for illustrative purposes.) The 
key property that we seek in approximations to an atomic norm || • ||^ is that they preserve algebraic 
structure such as the vertices/extreme points and more generally the low-dimensional faces of the 
conv(^l). As discussed in Section 12.51 points on such low-dimensional faces correspond to simple 
models, and algebraic-structure preserving approximations ensure that the tangent cones at simple 
models with respect to the approximations are not too much larger than the corresponding tangent 
cones with respect to the original atomic norms (see Section 14.31 for a concrete example) . 

4.2 Semidefinite Relaxations using Theta Bodies 

In this section we give a family of semidefinite relaxations to the atomic norm minimization problem 
whenever the atomic set has algebraic structure. To begin with if we approximate the atomic norm 
II " 11.4 by another atomic norm || • ||^ defined using a larger collection of atoms A Q A, it is clear 
that 
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Figure 2: The convex body given by the dotted line is a good metric approximation to the l\ ball. 
However as its "corners" are "smoothed out", the tangent cone at x* goes from being a proper 
cone (with respect to the l\ ball) to a halfspace (with respect to the approximation). 



Consequently outer approximations of the atomic set give rise to approximate norms that provide 
lower bounds on the optimal value of the problem ([5]) . 

In order to provide such lower bounds on the optimal value of ©, we discuss semidefinite 
relaxations of the convex hull conv(«4). All our discussion here is based on results described in [38] 
for semidefinite relaxations of convex hulls of algebraic varieties using theta bodies. We only give a 
brief review of the relevant constructions, and refer the reader to the vast literature on this subject 
for more details (see [381 [57] and the references therein). For subsequent reference in this section, 
we recall the definition of a polynomial ideal j9[ HQ] : 

Definition 4.2. A polynomial ideal / C M[x] is a subset of the ring of polynomials that contains 
the zero polynomial (the polynomial that is identically zero), is closed under addition, and has the 
property that f 6 I,g G M[x] implies that f ■ g £ I. 

To begin with we note that a sum- of- squares (SOS) polynomial in M[x] is a polynomial that can 
be written as the (finite) sum of squares of other polynomials in IR[x]. Verifying the nonnegativity 
of a multivariate polynomial is intractable in general, and therefore SOS polynomials play an 
important role in real algebraic geometry as an SOS polynomial is easily seen to be nonnegative 
everywhere. Further checking whether a polynomial is an SOS polynomial can be accomplished 
efficiently via semidefinite programming [57J. 

Turning our attention to the description of the convex hull of an algebraic variety, we will 
assume for the sake of simplicity that the convex hull is closed. Let / C R[x] be a polynomial ideal, 
and let Vjn(I) £ MP be its real algebraic variety: 

V R (I) = {x : /(x) = 0, V/ G I}. 

One can then show that the convex hull conv(Vfo(i")) is given as: 

conv(V]g(/)) = {x : /(x) > 0, V/ linear and nonnegative on Vr(-Z')} 

= {x : /(x) > 0, V/ linear s.t. f = h + g, V h nonnegative, V g € /} 
= {x : /(x) > 0, V/ linear s.t. / nonnegative modulo I}. 

A linear polynomial here is one that has a maximum degree of one, and the meaning of "modulo 
an ideal" is clear. As nonnegativity modulo an ideal may be intractable to check, we can consider 
a relaxation to a polynomial being SOS modulo an ideal, i.e., a polynomial that can be written 
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as Yli=i hi + 9 f° r 9 m the ideal. Since it is tractable to check via semideflnite programmming 
whether bounded-degree polynomials are SOS, the k-th theta body of an ideal I is defined as follows 
in [38]: 

THfc(I) = {x : /(x) > 0, V/ linear s.t. / is fe-sos modulo /}. 

Here fc-sos refers to an SOS polynomial in which the components in the SOS decomposition have 
degree at most k. The k-th. theta body THfc(I) is a convex relaxation of conv(T4?(-0), an d one can 
verify that 

conv(Vk(/)) C ■ ■ ■ C TH fe+1 (/) C TR k (V R (I)). 

By the arguments given above (see also [38]) these theta bodies can be described using semideflnite 
programs of size polynomial in k. Hence by considering theta bodies TH k (I) with increasingly 
larger k, one can obtain a hierarchy of tighter semideflnite relaxations of conv(VR(/)). We also 
note that in many cases of interest such semideflnite relaxations preserve low-dimensional faces of 
the convex hull of a variety, although these properties are not known in general. We will use some 
of these properties below when discussing approximations of the cut-polytope. 

Approximating tensor norms. We conclude this section with an example application of 
these relaxations to the problem of approximating the tensor nuclear norm. We focus on the case 
of tensors of order three that lie in jj mxmxm 5 i -e . 5 tensors indexed by three numbers, for notational 
simplicity, although our discussion is applicable more generally. In particular the atomic set A is 
the set of unit-Euclidean- norm rank-one tensors: 

A = {u<g) v® w : u, v, w £ R m , 1 1 u 1 1 = 1 1 v 1 1 = ||w|| = 1} 

= {N £ M m3 : N = u <g> v ® w, u, v, w £ R m , ||u|| = ||v|| = ||w|| = 1}, 

where u <g> v <g> w is the tensor product of three vectors. Note that the second description is written 
as the projection onto M m of a variety defined in R m + 3m . The nuclear norm is then given by ([2]), 
and is intractable to compute in general. Now let I4 denote a polynomial ideal of polynomial maps 
from M.™ 3 +z™ to R: 

m 

lA = {9 ■ 9 = #yfc(Ni J fc-UjV/vv fc )^ 

i,j,k=l 

Here g u ,9v,9w,{9ijk}i,j,k ar e polynomials in the variables N, u, v,w. Following the program de- 
scribed above for constructing approximations, a family of semideflnite relaxations to the tensor 
nuclear norm ball can be prescribed in this manner via the theta bodies TH k (Ij[). 

4.3 Tradeoff between Relaxation and Number of Measurements 

As discussed in Section [2.51 the atomic norm is the best convex heuristic for solving ill-posed linear 
inverse problems of the type considered in this paper. However we may wish to approximate the 
atomic norm in cases when it is intractable to compute exactly, and the discussion in the preceding 
section provides one approach to constructing a family of relaxations. As one might expect the 
tradeoff for using such approximations, i.e., a weaker convex heuristic than the atomic norm, is an 
increase in the number of measurements required for exact or robust recovery. The reason for this 
is that the approximate norms have larger tangent cones at their extreme points, which makes it 
harder to satisfy the empty intersection condition of Proposition 12.11 We highlight this tradeoff 
here with an illustrative example involving the cut polytope. 

The cut polytope is defined as the convex hull of all cut matrices: 

V = conv{zz T : z £ {-l,+l} m }. 
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Figure 3: A toy sketch illustrating the cut polytope V, and the two approximations V\ and Vi- 
Note that V\ is a sketch of the standard semidefinite relaxation that has the same vertices as V. 
On the other hand Vi is a polyhedral approximation to V that has many more vertices as shown 
in this sketch. 



As described in Section \2. 21 low- rank matrices that are composed of ±l's as entries are of interest in 
collaborative filtering [66J, and the norm induced by the cut polytope is a potential convex heuristic 
for recovering such matrices from limited measurements. However it is well-known that the cut 
polytope is intractable to characterize [25], and therefore we need to use tractable relaxations 
instead. We consider the following two relaxations of the cut polytope. The first is the popular 
relaxation that is used in semidefinite approximations of the MAXCUT problem: 

V\ = {M :M symmetric, M y 0, Ma = 1, Vi = 1, • • • ,p}. 

This is the well-studied elliptope [25J, and can be interpreted as the second theta body relaxation 
(see Section B~2j) of the cut polytope V [38J. We also investigate the performance of a second, 
weaker relaxation: 

Vi = {M : M symmetric, Ma = l,Vi, |ik%| < l,Vi ^ j}. 

This polytope is simply the convex hull of symmetric matrices with ±l's in the off-diagonal entries, 
and l's on the diagonal. We note that Vi is an extremely weak relaxation of V, but we use it here 
only for illustrative purposes. It is easily seen that 

V C Vi C V 2 , 

with all the inclusions being strict. Figure [3] gives a toy sketch that highlights all the main geometric 
aspects of these relaxations. In particular V\ has many more extreme points that V, although the 
set of vertices of V\, i.e., points that have full-dimensional normal cones, are precisely the cut 
matrices (which are the vertices of V) [25]. The convex polytope Vi contains many more vertices 
compared to V as shown in Figure El As expected the tangent cones at vertices of V become 
increasingly larger as we use successively weaker relaxations. The following result summarizes the 
number of random measurements required for recovering a cut matrix, i.e., a rank-one sign matrix, 
using the norms induced by each of these convex bodies. 

Proposition 4.3. Suppose x* £ R mxm is a rank-one sign matrix, i.e., a cut matrix, and we are 
given n random Gaussian measurements o/x*. We wish to recover x* by solving a convex program 
based on the norms induced by each of V, Vi, Vi- We have exact recovery o/x* in each of these 
cases with high probability under the following conditions on the number of measurements: 
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1. Using V: n = 0(m). 



2. Using V\: n = 0{m). 

3. Using V 2 : n = ^f^. 

Proof. For the first part, we note that V is a symmetric polytope with 2 m_1 vertices. Therefore 
we can apply Corollary 13.141 to conclude that n = 0(m) measurements suffices for exact recovery. 

For the second part we note that the tangent cone at x* with respect to the nuclear norm ball 
ofmxm matrices contains within it the tangent cone at x* with respect to the polytope V\ ■ Hence 
we appeal to Proposition 13. 1 ll to conclude that n = 0(m) measurements suffices for exact recovery. 

Finally, we note that V% is essentially the hypercube in (™) dimensions. Appealing to Proposi- 

2 

tion 13.121 we conclude that n = 171 measurements suffices for exact recovery. □ 

It is not too hard to show that these bounds are order-optimal, and that they cannot be 
improved. Thus we have a rigorous demonstration in this particular instance of the fact that the 
number of measurements required for exact recovery increases as the relaxations get weaker (and 
as the tangent cones get larger). The principle underlying this illustration holds more generally, 
namely that there exists a tradeoff between the complexity of the convex heuristic and the number 
of measurements required for exact or robust recovery. It would be of interest to quantify this 
tradeoff in other settings, for example, in problems in which we use increasingly tighter relaxations 
of the atomic norm via theta bodies. 

We also note that the tractable relaxation based on V\ is only off by a constant factor with re- 
spect to the optimal heuristic based on the cut polytope V . This suggests the potential for tractable 
heuristics to approximate hard atomic norms with provable approximation ratios, akin to meth- 
ods developed in the literature on approximation algorithms for hard combinatorial optimization 
problems. 

4.4 Terracini's Lemma and Lower Bounds on Recovery 

Algebraic structure in the atomic set A also provides a means for computing lower bounds on the 
number of measurements required for exact recovery. The recovery condition of Proposition 12.11 
states that the nullspace null^) of the measurement operator <3? : MP — > R n must miss the tangent 
cone T_4(x*) at the point of interest x*. Suppose that this tangent cone contains a g-dimensional 
subspace. It is then clear from straightforward linear algebra arguments that the number of mea- 
surements n must exceed q. Indeed this bound must hold for any linear measurement scheme. 
Thus the dimension of the subspace contained inside the tangent cone (i.e., the dimension of the 
lineality space) provides a simple lower bound on the number of linear measurements. 

In this section we discuss a method to obtain estimates of the dimension of a subspace component 
of the tangent cone. We focus again on the setting in which A is an algebraic variety. Indeed in 
all of the examples of Section 12.21 the atomic set A is an algebraic variety. In such cases simple 
models x* formed according to ([I]) can be viewed as elements of secant varieties. 

Definition 4.4. Let A £ W be an algebraic variety. Then the k 'th secant variety A k is defined as 
the union of all affine spaces passing through any k + 1 points of A. 

Secant varieties and their tangent spaces have been extensively studied in algebraic geometry 
[40] . A particular question of interest is to characterize the dimensions of secant varieties and 
tangent spaces. In our context, estimates of these dimensions are useful in giving lower bounds on 
the number of measurements required for recovery. Specifically we have the following result, which 
states that certain linear spaces must lie in the tangent cone at x* with respect to conv(„4): 
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Proposition 4.5. Let A C MP be a smooth variety, and let T(u,A) denote the tangent space at 
any u £ A with respect to A. Suppose x = X/i=i c i a «> £ A, q > 0, sucft. i/iai 

ll x IU = ^2 c i- 

1=1 

Then the tangent cone T^(x*) contains the following linear space: 

T(a u A)®---®T{a k ,A) C T A {x*), 
where © denotes the direct sum of subspaces. 

Proof. We note that if we perturb ai slightly to any neighboring a' x so that a' x 6 A, then the 
resulting x' = cia[ + X]f=2 c 2 a « is such that ||x'||^4 < ||x||^. The proposition follows directly from 
this observation. □ 

This result is applicable, for example, when A is the variety of rank-one matrices or the variety of 
rank-one tensors as these are smooth varieties. By Terracini's lemma [40J from algebraic geometry 
the subspace T(ai,A) © • • • © T(a k) A) is in fact the estimate for the tangent space T(x, A k ~ 1 ) at 
x with respect to the (k — l)'th secant variety A k ~ 1 : 

Proposition 4.6 (Terracini's Lemma). Let A C MP be a smooth affine variety, and let T(u,A) 

denote the tangent space at any u£i with respect to A. Suppose x E A^ 1 is a generic point such 
that x = ^^xQaj, Vaj € A,c-i > 0. Then the tangent space T('X-,A k ~ 1 ) at x with respect to the 
secant variety A k ~ l is given by T(&i,A) © • • • © T(a.k,A). Moreover the dimension of T('x.,A k ^ 1 ) 
is at most (and is expected to be) mm{p, (k + l)dim(^l) + k}. 

Combining these results we have that estimates of the dimension of the tangent space 7~(x, A k ~ 1 ) 
lead directly to lower bounds on the number of measurements required for recovery. The intuition 
here is clear as the number of measurements required must be bounded below by the number of 
"degrees of freedom," which is captured by the dimension of the tangent space 7~(x, ^l^" 1 ). However 
Terracini's lemma provides us with general estimates of the dimension of 7"(x,^l fc_1 ) for generic 
points x. Therefore we can directly obtain lower bounds on the number of measurements, purely 
by considering the dimension of the variety A and the number of elements from A used to construct 
x (i.e., the order of the secant variety in which x lies). As an example the dimension of the base 
variety of normalized order-three tensors in flj mxmxm i s 3(7^ _ 1). Consequently if we were to in 
principle solve the tensor nuclear norm minimization problem, we should expect to require at least 
0{km) measurements to recover a rank- A; tensor. 

5 Computational Experiments 
5.1 Algorithmic Considerations 

While a variety of atomic norms can be represented or approximated by linear matrix inequalities, 
these representations do not necessarily translate into practical implementations. Semidefinite pro- 
gramming can be technically solved in polynomial time, but general interior point solvers typically 
only scale to problems with a few hundred variables. For larger scale problems, it is often preferable 
to exploit structure in the atomic set A to develop fast, first-order algorithms. 
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A starting point for first-order algorithm design lies in determining the structure of the proximity 
operator (or Moreau envelope) associated with the atomic norm, 



IIU(x; /Li) := argmin ^\\z — x|| 2 + /x||z||,4 . (18) 



Here /i is some positive parameter. Proximity operators have already been harnessed for fast 
algorithms involving the l\ norm |34[ [20l [2T1 [39l [71] and the nuclear norm [501 fl~2"j [69] where these 
maps can be quickly computed in closed form. For the l\ norm, the ith component of II_4(x; \x) is 
given by 

Xj + fJL Xj < -fj, 

U A (x; n\ = I -(J, < Xi < /I . (19) 

k X, — /U Xj > fl 

This is the so-called soft thresholding operator. For the nuclear norm, II4 soft thresholds the 
singular values. In either case, the only structure necessary for the cited algorithms to converge 
is the convexity of the norm. Indeed, essentially any algorithm developed for t\ or nuclear norm 
minimization can in principle be adapted for atomic norm minimization. One simply needs to apply 
the operator 11^4 wherever a shrinkage operation was previously applied. 

For a concrete example, suppose / is a smooth function, and consider the optimization problem 

min /(x) + /illxm . (20) 

X 

The classical projected gradient method for this problem alternates between taking steps along the 
gradient of / and then applying the proximity operator associated with the atomic norm. Explicitly, 
the algorithm consists of the iterative procedure 

x fc+ i = II^(x fc - a k Vf(x k );a k \) (21) 

where {a k } is a sequence of positive stepsizes. Under very mild assumptions, this iteration can be 
shown to converge to a stationary point of (f20t) [35J. When / is convex, the returned stationary 
point is a globally optimal solution. Recently, Nesterov has described a particular variant of this 
algorithm that is guaranteed to converge at a rate no worse than 0(k~ 1 ), where k is the iteration 
counter |56] , Moreover, he proposes simple enhancements of the standard iteration to achieve an 
0(k~ 2 ) convergence rate for convex / and a linear rate of convergence for strongly convex /. 
If we apply the projected gradient method to the regularized inverse problem 

min ||$x - y|| 2 + A||x|U (22) 
then the algorithm reduces to the straightforward iteration 

Xfc+i = n^(x fc + a k &(y - $x fc ); a k \) . (23) 

Here (j22[) is equivalent to © for an appropriately chosen A > and is useful for estimation from 
noisy measurements. 

The basic (noiseless) atomic norm minimization problem ([5]) can be solved by minimizing a 
sequence of instances of (I22D with monotonically decreasing values of A. Each subsequent mini- 
mization is initialized from the point returned by the previous step. Such an approach corresponds 
to the classic Method of Multipliers [6] and has proven effective for solving problems regularized 
by the l\ norm and for total variation denoising [72, 13J. 
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This discussion demonstrates that when the proximity operator associated with some atomic 
set A can be easily computed, then efficient first-order algorithms are immediate. For novel atomic 
norm applications, one can thus focus on algorithms and techniques to compute the associated 
proximity operators. We note that, from a computational perspective, it may be easier to compute 
the proximity operator via dual atomic norm. Associated to each proximity operator is the dual 
operator 

A A (x;fi) = argmin±||y - x|| 2 s.t. ||y||^ < /i (24) 

By an appropriate change of variables, A_4 is nothing more than the projection of // _1 x onto the 
unit ball in the dual atomic norm: 

A^x;^) = argmini||y - /Li _1 x|| 2 s.t. ||y||^ < 1 (25) 

From convex programming duality, we have x = Il_4(x;/i) + A A (x;fi). This can be seen by 
observing 

x|| 2 + (y,z) (26) 

x|| 2 + (y,z) (27) 

= max -±||y-x|| 2 + ±||x|| 2 (28) 
\\y\\* A <n 

In particular, II_4(x; /i) and A_4(x; /i) form a complementary primal-dual pair for this optimization 
problem. Hence, we only need to able to efficiently compute the Euclidean projection onto the dual 
norm ball to compute the proximity operator associated with the atomic norm. 

Finally, though the proximity operator provides an elegant framework for algorithm generation, 
there are many other possible algorithmic approaches that may be employed to take advantage of 
the particular structure of an atomic set A. For instance, we can rewrite (|24p as 

A^(x;u) = argmin±||y - /^xll 2 s.t. (y,a)<l Va G A (29) 
y 

Suppose we have access to a procedure that, given z G R n , can decide whether (z,a) < 1 for 
all a G A, or can find a violated constraint where (z,a) > 1. In this case, we can apply a 
cutting plane method or ellipsoid method to solve (I24p or ([6]) [551 160] . Similarly, if it is simpler 
to compute a subgradient of the atomic norm than it is to compute a proximity operator, then 
the standard subgradient method [55] can be applied to solve problems of the form (j22j) . Each 
computational scheme will have different advantages and drawbacks for specific atomic sets, and 
relative effectiveness needs to be evaluated on a case-by-case basis. 

5.2 Simulation Results 

We describe the results of numerical experiments in recovering orthogonal matrices, permutation 
matrices, and rank-one sign matrices (i.e., cut matrices) from random linear measurements by solv- 
ing convex optimization problems. All the atomic norm minimization problems in these experiments 
are solved using a combination of the SDPT3 package [68J and the YALMIP parser [49] . 

Orthogonal matrices. We consider the recovery of 20 x 20 orthogonal matrices from random 
Gaussian measurements via spectral norm minimization. Specifically we solve the convex program 
([5]) , with the atomic norm being the spectral norm. Figure E] gives a plot of the probability of exact 
recovery (computed over 50 random trials) versus the number of measurements required. 



minillz— xll + iillzll a = min max illz — 

= max min illz — 
l|y|l^<M z 
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Figure 4: Plots of the number of measurements available versus the probability of exact recovery 
(computed over 50 trials) for various models. 



Permutation matrices. We consider the recovery of 20 x 20 permutation matrices from 
random Gaussian measurements. We solve the convex program ([5]), with the atomic norm being 
the norm induced by the Birkhoff polytope of 20 x 20 doubly stochastic matrices. Figure [4] gives 
a plot of the probability of exact recovery (computed over 50 random trials) versus the number of 
measurements required. 

Cut matrices. We consider the recovery of 20 x 20 cut matrices from random Gaussian 
measurements. As the cut polytope is intractable to characterize, we solve the convex program 
([5]) with the atomic norm being approximated by the norm induced by the semidefinite relaxation 
V\ described in Section 14.31 Recall that this is the second theta body associated with the convex 
hull of cut matrices, and so this experiment verifies that objects can be recovered from theta-body 
approximations. Figure 0] gives a plot of the probability of exact recovery (computed over 50 
random trials) versus the number of measurements required. 

In each of these experiments we see agreement between the observed phase transitions, and the 
theoretical predictions (Propositions 13.131 13.151 and 14. 3ft of the number of measurements required 
for exact recovery. In particular note that the phase transition in Figure H] for the number of 
measurements required for recovering an orthogonal matrix is very close to the prediction n ~ 
3m -m _ 295 Q f Proposition ^. 131 We refer the reader to |28 1 l62 1 l5T] for similar phase transition plots 
for recovering sparse vectors, low-rank matrices, and signed vectors from random measurements 
via convex optimization. 

6 Conclusions and Future Directions 

This manuscript has illustrated that for a fixed set of base atoms, the atomic norm is the best choice 
of a convex regularizer for solving ill-posed inverse problems with the prescribed priors. With this 
in mind, our results in Section [3] and Section [4] outline methods for computing hard limits on the 
number of measurements required for recovery from any convex heuristic. Using the calculus of 
Gaussian widths, such bounds can be computed in a relatively straightforward fashion, especially 
if one can appeal to notions of convex duality and symmetry. This computational machinery of 
widths and dimension counting is surprisingly powerful: near-optimal bounds on estimating sparse 
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vectors and low-rank matrices from partial information follow from elementary integration. Thus we 
expect that our new bounds concerning symmetric, vertex-transitive polytopes are also nearly tight. 
Moreover, algebraic reasoning allowed us to explore the inherent trade-offs between computational 
efficiency and measurement demands. More complicated algorithms for atomic norm regularization 
might extract structure from less information, but approximation algorithms are often sufficient for 
near optimal reconstructions. 

This report serves as a foundation for many new exciting directions in inverse problems, and 
we close our discussion with a description of several natural possibilities for future work: 

Width calculations for more atomic sets. The calculus of Gaussian widths described in 
Section [3] provides the building blocks for computing the Gaussian widths for the application 
examples discussed in Section [2l We have not yet exhaustively estimated the widths in all of these 
examples, and a thorough cataloging of the measurement demands associated with different prior 
information would provide a more complete understanding of the fundamental limits of solving 
underdetermined inverse problems. Moreover, our list of examples is by no means exhaustive. The 
framework developed in this paper provides a compact and efficient methodology for constructing 
regularizers from very general prior information, and new regularizers can be easily created by 
translating grounded expert knowledge into new atomic norms. 

Recovery bounds for structured measurements. Our recovery results focus on generic mea- 
surements because, for a general set .4, it does not make sense to delve into specific measurement 
ensembles. Particular structures of the measurement matrix <3? will depend on the application and 
the atomic set A. For instance, in compressed sensing, much work focuses on randomly sampled 
Fourier coefficients [16] and random Toeplitz and circulant matrices [411 [6Tj . With low-rank matri- 
ces, several authors have investigated reconstruction from a small collection of entries [17J. In all of 
these cases, some notion of incoherence plays a crucial role, quantifying the amount of information 
garnered from each row of <3?. It would be interesting to explore how to appropriately generalize 
notions of incoherence to new applications. Is there a particular definition that is general enough 
to encompass most applications? Or do we need a specialized concept to match the specifics of 
each atomic norm? 

Quantifying the loss due to relaxation. Section 14.31 illustrates how the choice of approxima- 
tion of a particular atomic norm can dramatically alter the number of measurements required for 
recovery. However, as was the case for vertices of the cut polytope, some relaxations incur only 
a very modest increase in measurement demands. Using techniques similar to those employed in 
the study of semidefinite relaxations of hard combinatorial problems, is it possible to provide a 
more systematic method to estimate the number of measurements required to recover points from 
polynomial-time computable norms? 

Atomic norm decompositions. While the techniques of Section[3]and Section |4] provide bounds 
on the estimation of points in low-dimensional secant varieties of atomic sets, they do not provide 
a procedure for actually constructing decompositions. That is, we have provided bounds on the 
number of measurements required to recover points x of the form 




aeA 
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when the coefficient sequence {c a } is sparse, but we do not provide any methods for actually 
recovering c itself. These decompositions are useful, for instance, in actually computing the rank- 
one binary vectors optimized in semidefinite relaxations of combinatorial algorithms |36|, [541 [2], 
or in the computation of tensor decompositions from incomplete data [46]. Is it possible to use 
algebraic structure to generate deterministic or randomized algorithms for reconstructing the atoms 
that underlie a vector x, especially when approximate norms are used? 



Large-scale algorithms. Finally, we think that the most fruitful extensions of this work lie in 
a thorough exploration of the empirical performance and efficacy of atomic norms on large-scale 
inverse problems. The proposed algorithms in Section [5] require only the knowledge of the proximity 
operator of an atomic norm, or a Euclidean projection operator onto the dual norm ball. Using 
these design principles and the geometry of particular atomic norms should enable the scaling of 
atomic norm techniques to massive data sets. 
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A Proof of Proposition 13.61 

Proof. First note that the Gaussian width can be upper-bounded as follows: 



w 



(C n s^ 1 ) < Eg 



sup g z 

zeCnB(o,i) 



(30) 



where ,6(0, 1) denotes the unit Euclidean ball. The expression on the right hand side inside the 
expected value can be expressed as the optimal value of the following convex optimization problem 
for each g£l ? : 

max z g T z 

s.t. z G C (31) 
||z|| 2 < 1 

We now proceed to form the dual problem of (I31j) by first introducing the Lagrangian 

C(z, u, 7) = g T z + 7(1 - z T z) - u T z 

where u£C* and 7 > is a scalar. To obtain the dual problem we maximize the Lagrangian with 
respect to z, which amounts to setting 



2 7 



(g-u). 



Plugging this into the Lagrangian above gives the dual problem 

min 7+4^l|g- u l| 2 
s.t. u G C* 
7 > 0. 
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Solving this optimization problem with respect to 7 we find that 7 = ^||g — u||, which gives the 
dual problem to (f3"Tj) 

min ||g-u|| 

s.t. u G C* y ' 

Under very mild assumptions about C, the optimal value of f|32|) is equal to that of (|3ip (for example 
as long as C has a non-empty relative interior, strong duality holds). Hence we have derived 



E P 



T 

sup g z 



E g [dist(g,C*)] 



zeCnB(o,i) 

This equation combined with the bound (f30j) gives us the desired result. 



(33) 
□ 



B Proof of Theorem [X9] 

Proof. We set f3 = ^. First note that if (3 > | exp{|} then the width bound exceeds ^/p, which is 
the maximal possible value for the width of C. Thus, we will assume throughout that (3 < j exp{|}. 

Using Proposition 13.61 we need to upper bound the expected distance to the polar cone. Let 
g ~ A/"(0, 1) be a normally distributed random vector. Then the norm of g is independent from the 
angle of g. That is, ||g|| is independent from g/||g||. Moreover, g/||g|| is distributed as a uniform 
sample on § p_1 , and E g [||g||] < yfp. Thus we have 

E g [dist(g,C*)] <E g [||g|| •dist(g/||g||,C*n§P- 1 )] < VpE u [dist(u,C*nS p - 1 )] (34) 

where u is sampled uniformly on W -1 . 

To bound the latter quantity, we will use isoperimetry. Suppose A is a subset of BP~ l and B 
is a spherical cap with the same volume as A. Let N(A, r) denote the locus of all points in the 
sphere of Euclidean distance at most r from the set A. Let /i denote the Haar measure on S p_1 and 
n(A;r) denote the measure of N(A,r). Then spherical isoperimetry states that fi(A;r) > n(B;r) 
for all r > (see, for example |47} 152] ). 

Let B now denote a spherical cap with n(B) = fi(C* n Then we have 

/■OO 

E u [dist(u,C* n§ p ^)] = / P[dist(u,C* nS^ 1 ) > t]dt (35) 

Jo 

poo 

= / (1 -n(C* nSP -1 ;*))* (36) 

/•oo 

< / (l-/j.(B;t))dt (37) 



Jo 

where the first equality is the integral form of the expected value and the last inequality follows by 
isoperimetry. Hence we can bound the expected distance to the polar cone intersecting the sphere 
using only knowledge of the volume of spherical caps on S p_1 . 

To proceed let v(ip) denote the volume of a spherical cap subtending a solid angle 92. An explicit 
formula for v{ip) is 

v ((p) = z - 1 [ sin^-V)^ (38) 
Jo 

where z p = f£ sin p_1 (-i?)<i?? |44j . Let <f{j3) denote the minimal solid angle of a cap such that (3 copies 
of that cap cover S p_1 . Since the geodesic distance on the sphere is always greater than or equal 
to Euclidean distance, if K is a spherical cap subtending ip radians, fj>(K;t) > v(ip + 1). Therefore 

roo 

(1 - fi(B; t))dt < I (1 - v((p(/3) + t))dt . (39) 
^0 
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We can proceed to simplify the right-hand-side integral: 

(l-v( i p(P)+t))dt= / (l-v(<p(P)+t))dt (40) 



m-tp(p) 

7T -¥>(/?)- / v(<p(J3)+t)dt (41) 



rir-<p(J3) M/3)+t 

tt - (p(p) -z- 1 / sin^- 1 tid&dt (42) 



/■7T t"K— tp(P) 

tt - <p{p) - z p 1 / / sin?- 1 Mtdti (43) 

JO Jmax(i?-(^(/3),0) 

tt - <p(J3) - / {tt - tp(p) - max(tf - <p(P),Q)} sin* 5 " 1 tfdtf (44) 
J o 

z" 1 / max(i?-(^(/3),0)sin p - 1 ^ (45) 



2 



p 1 / (i?-^(/3))sin p - 1 ^ (46) 
W) 



l3l) follows by switching the order of integration and the rest of these equalities follow by straight- 
forward integration and some algebra. 

Using the inequalities that z v > ~/==T ( see an d sin(x) < exp(— (x — 7r/2) 2 /2) for x G [0, tt], 
we can bound the last integral as 

z -i f W (a v/'-' / ,,, ,,. :mi / /'- • ,., .-,:> 



(0 - ^(/3)) sinf" 1 Md < ^ (0 - ¥>(/?)) exp f-^(<? - | ) 2 ) eft? (47) 

Performing the change of variables a = — — \ ), we are left with the integral 

' r "" ' f " +f;-^ ) Uex P f-^do (48) 



2 Jy/^=l{ V {l3)-TT/2) I \/P - 1 ^ 2 'J V 2 



1 ( a 

exp 



2Vp Tr T ^ V 2 



,2 



+ exp (-^) da (jfl) 



1 ^ p — 1 ( ,_ /r> /o-v^A , \TT (TT 



2 

£ ^ exp (" V W2 " vOT)2 J + Vf It " (50 » 

In this final bound, we bounded the first term by dropping the upper integrand, and for the second 
term we used the fact that 

/oo 
exp(-x 2 / 2 )ax = V^r. (51) 
-oo 

We are now left with the task of computing a lower bound for (p((3). We need to first repa- 
rameterize the problem. Let K be a spherical cap. Without loss of generality, we may assume 
that 

K = {x G S^ 1 : x x > h) (52) 

for some /i G [0,1]. h is the height of the cap over the equator. Via elementary trigonometry, the 
solid angle that K subtends is given by tt/2 — sin~ 1 (/i). Hence, if h(/3) is the largest number such 
that /3 caps of height h cover S p_1 , then h(/3) = sin(7r/2 — 

The quantity h(/3) may be estimated using the following estimate from |11| . For h G [0, 1], let 
7(p, h) denote the volume of a spherical cap of S p_1 of height h. 
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Lemma B.l ([TT])- For 1 > h > 



i. p—1 , , . 1 . , o. P— 1 



(1 - h 2 )— < j(p,h) < — (1-h 2 )—. (53) 



Note that for /i > -|=, 



2^ 

So if 



(1 - < hi - < \ exp(-^i/ l 2 ) . (54) 



then h < 1 because we have assumed (3 < \ exp{|} and p > 9. Moreover, h> and the volume 
of the cap with height h is less than or equal to 1//3. That is 



pm >,/2-™'» . (56) 

Combining the estimate (|50p with Proposition 13.61 and using our estimate for v?(/3), we get the 
bound 



S I^-p (-^ (^&) j + ^ ) (57 ) 

This expression can be simplified by using the following bounds. First, sin _1 (x) > x lets us upper 
bound the first term by y^p^Ts^' F° r the second term, using the inequality sin _1 (x) < results 
in the upper bound 

"W^^T^™)' (58) 

For p > 9 the upper bound can be expressed simply as w(C) < 3y / log(4/3). We recall that (3 = ^, 
which completes the proof of the theorem. □ 



C Direct Width Calculations 

We first give the proof of Proposition 13.101 

Proof. Let x* be an s-sparse vector in MP with £± norm equal to 1, and let A denote the set of 
unit-Euclidean-norm one-sparse vectors. Let A denote the set of coordinates where x* is non-zero. 
The normal cone at x* with respect to the i\ ball is given by 

N A (x*) = cone{z 6 M p : Zi = sgn(x*) for i G A, |z;| < 1 for i 6 A c } (59) 
= {z £M P : Zi = isgn(x*) for i £ A, |zj| < t for i E A c for some t > 0} . (60) 
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Here A c represents the zero entries of x*. The minimum squared distance to the normal cone at 
x* can be formulated as a one-dimensional convex optimization problem for arbitrary z £ MP 



inf ||z - u||| = inf J^(z; - isgn(x*)) 2 + ^ ( z i " u i) 2 
\ui\<t, ieA^eA jeA c 

= l^lY,^ ~ ts ^( x t)) 2 + Yl shrink( Zj ,t) 2 

~ «eA ieA c 

where 

' z + t z < -t 
-t<z<t 
^z — t z > t 

is the ^-shrinkage function. Hence, for any fixed t > independent of g, we have 



shrink(z, t) 



(61) 
(62) 

(63) 



E 



inf ||g — u|| 2 

u€AU(x*) 



< E 



J2(Si - fegn(x*)) 2 + shrink(g„i) 2 

ieA jeA c 



s(l + t 2 ) + E 



^ shrink(gj,t) 2 



(64) 



(65) 



Now we directly integrate the second term, treating each summand individually. For a zero- 
mean, unit-variance normal random variable g, 



E 



1 1 f°° 

[shrink^, t) 2 ] =^= (g + t) 2 exp(-g 2 /2)dg + — / ( fl - t) 2 exp(- ff 2 /2)d 5 (66) 

/oo 
( 5 -t) 2 exp(V/2)(i 5 (67) 

(68) 



2 

?7T ./, 



< 



9 2(1 -I- f 2 "l f 00 

; texp(-t 2 /2)+ 1 Z_ J / exp(- 5 2 /2)^ 
27T V2vr Jt 

/ + 1±^! ] exp(-tV2) 



2vr 
2 1 



2vr i 



-exp(-r/2) 



(69) 
(70) 



The first simplification follows because the shrink function and Gaussian distributions are symmet- 
ric about the origin. The second equality follows by integrating by parts. The inequality follows 
by a tight bound on the Gaussian Q-function 



1 r°° 1 1 

Q(x) = ^J exp(-g 2 /2)dg < — -exp(-x 2 /2) for x > . 

Using this bound, we get 



E 



inf ||g-u|| 2 

ueAU(x*) 



< s(l + t 2 ) + (p- a)-^=\ exp(-t 2 /2) 



2vrt 



Setting t = ^/2 log(p/s) gives 



E 



inf 

zeAU(x*) 



—111 



< , j L + 2 log (-)) + ^=4= < 2 * log (p/s) + \a. 

\s// Tiyiog^/s) 



(71) 



(72) 



(73) 
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The last inequality follows because 



v < 0.204 < 1/4 (74) 

7!ylog(f>/s) 

whenever < s < p. 

□ 

Next we give the proof of Proposition 13.111 

Proof. Let x* be an mi x m 2 matrix of rank r with singular value decomposition UTiV* , and 
let A denote the set of rank-one unit-Euclidean-norm matrices of size mi x mi. Without loss of 
generality, impose the conventions m\ < m 2 , S is r x r, U is m\ x r, V is m-2 x r, and assume the 
nuclear norm of x* is equal to 1. 

Let Ufc (respectively v&) denote the fe'th column of U (respectively V). It is convenient to 
introduce the orthogonal decomposition W miXm2 = A © A 1 - where A is the linear space spanned 
by elements of the form u^,z T and yv|\ 1 < k < r, where z and y are arbitrary, and A 1 - is the 
orthogonal complement of A. The space A 1 - is the subspace of matrices spanned by the family 
(yz T ), where y (respectively z) is any vector orthogonal to all the columns of U (respectively V). 
The normal cone of the nuclear norm ball at x* is given by the cone generated by the subdifferential 
at x*: 

N A (^) = cone{UV T + W €R rniXm2 : W T U = 0, WV = 0, \\W\\* A < 1} (75) 
= {tUV* + W eR miXm2 : W T U = 0, WV = 0, \\W\\* A < t, t>0}. (76) 

Note that here H^H^ is the operator norm, equal to the maximum singular value of Z |62j . 

Let G be a Gaussian random matrix with i.i.d. entries, each with mean zero and unit variance. 
Then the matrix 

Z(G) = \\V A ±(G)\\UV*+V A ±(G) (77) 
is in the normal cone at x*. We can then compute 

E [||G - Z(G)f F ] = E [\\Va(G) - V A (Z(G))\\l] (78) 
= E[\\P A (G)f F ]+E[\\P A (Z(G))f F ] (79) 
= r(mi +m 2 -r) + rE[||7 : ' A ±(G)|| 2 ] . (80) 

Here (fT9"j) follows because T A (G) and V a a-{G) are independent. The final line follows because 
dim(T) = r(mi+ra2-r) and the Frobenius (i.e., Euclidean) norm of UV* is = y/r. Due to 

the isotropy of Gaussian random matrices, V A ± (G) is identically distributed as an (mi — r) x {mi — r) 
matrix with i.i.d. Gaussian entries each with mean zero and variance one. We thus know that 

P (°) II > y/m-L-r + y/m 2 - r + s] < exp {s 2 /2) (81) 

(see, for example, [22] ). To bound the latter expectation, we again use the integral form of the 
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expected value. Letting fj, T ± denote the quantity \Jm[ — r + \/m<i — r, we have 

/•oo 

E [\\r A ,(G)\\ 2 ] = / P [||P A x(G)|| 2 > h] dh (82) 
Jo 

/•oo 

<l4±+ F[\\V A± (G)\\ 2 > h]dh (83) 
J v 2 T± 

/•oo 

< fJ 2 T±+ P[\\V A ±(G)\\ 2 >&±+t]dt (84) 
Jo 

/•oo r , 

<fj, 2 T± + P \\V A ±(G)\\ > fi T ± + Vt dt (85) 
Jo 1 J 

/■oo 

< /4-l + / exp(-t/2)dt (86) 
Jo 

= ^ 2 T± +2 (87) 
Using this bound in (|5Uj) . we get that 



E 



inf ||G-Z|| 2 F 
zeAf^(x*) 



< r{m\ + rri2 — r) + r(\/mi — r + ^/mj — r) 2 + 2r 

< r(mi + m2 — r) + 2r(mi + w-2 — 2r) + 2r (89) 

< 3r(mi + m2 — r) (90) 

where the second inequality follows from the fact that (a + 6) 2 < 2a 2 + 26 2 . We conclude that 
3r(mi + r«2 — r) random measurements are sufficient to recover a rank r, mi x 7712 matrix using 
the nuclear norm heuristic. □ 
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